EPILEPTIC SEIZURE RECOGNITION
Data Manipulation
import pandas as pd
raw = pd.read_csv("epileptic-seizure-recognition.csv")
raw = raw.drop("Unnamed",axis=1)
raw.head()
In the data, each of the 178 columns is the value of EEG recording for 1 second. People are grouped into 5 different groups based on the inputs coming from the EEG recordings. If we show the groups with a range of numbers from 1 to 5;
Since the detection of epileptic seizures is the main focus of this project, these five groups can be divided into two groups based on the epileptic seizure activity. The people who are grouped in classes 2,3,4, and 5 are the ones whose EEG recording does not show any epileptic seizure activity during the recording and those people will be labeled as 0. The people who are grouped in class 1, on the other hand, had some seizure activity during recording and they will be labeled as 1.
y = raw["y"]
e = []
for x in range(0,len(raw)):
if y[x] in [2,3,4,5]:
e.append(0)
else:
e.append(1)
raw.append(e)
raw.insert(179,"e",e,True)
raw.head()
Principal Component Analysis for Data Visualization
To get an idea about the distribution of the classes, the data can be visualized after reducing the number of features from 178 to only 2. Principal Component Analysis is one of the most widely used approaches in this dimensionality reduction process and I will use it to visualize the data.
Because the scale of the features is very different, they should be standardized first before starting PCA.
from sklearn.preprocessing import StandardScaler
features = raw.columns[0:178]
X = raw.loc[:, features].values
y = raw.loc[:,"y"]
e = raw.loc[:,"e"]
X = StandardScaler().fit_transform(X)
from sklearn.model_selection import train_test_split
X_train, X_test, e_train, e_test = train_test_split(X, e, test_size = 0.25, random_state = 21, stratify = e)
from sklearn.decomposition import PCA
import pandas as pd
from bokeh.plotting import figure, output_file, show
from bokeh.io import output_notebook,reset_output,save,show
output_notebook()
def pca_visualization(X,y):
pca = PCA(n_components = 2)
pc = pca.fit_transform(X)
pcDF = pd.DataFrame(data = pc, columns = ['pc1','pc2'])
pcDF['e'] = y.values
epilepsy = pcDF.loc[pcDF['e']==1,]
not_epilepsy = pcDF.loc[pcDF['e']==0,]
reset_output()
output_notebook()
p = figure()
p.scatter(x = epilepsy['pc1'],
y = epilepsy['pc2'],
marker="x",
fill_alpha=0.5,
size=6,
legend = "Epilepsy")
p.scatter(x = not_epilepsy['pc1'],
y = not_epilepsy['pc2'],
marker="triangle",
fill_alpha=0.3,
size=6,
fill_color="orange",
line_color="orange",
legend = "Not Epilepsy")
p.xaxis.axis_label = "Principal Component 1"
p.yaxis.axis_label = "Principal Component 2"
show(p)
pca_visualization(X_train,e_train)
import numpy as np
rate = np.mean(e_train)
length = len(e_train)
print("There are {}".format(length), "examples in the training set and only {}".format(round(length*rate)), "of them are labeled as Epilepsy.")
As can be seen from the graph, which is obtained after reducing the number of features with PCA, and the number of positive labels, the data is imbalanced. Putting an imbalanced dataset into a machine learning algorithm directly may result in inaccurate predictions since the algorithm may ignore the minority class. I will handle this problem with oversampling.
Over-Sampling with SMOTE
SMOTE is one of the most commonly used oversampling techniques which simply synthesizes new data for the minority classes to obtain an equal number of cases for each label. It is important to note that the algorithm should be used only for training dataset to prevent bias.
from imblearn.over_sampling import SMOTE
oversampling = SMOTE()
X_train_b, e_train_b = oversampling.fit_resample(X_train,e_train)
from collections import Counter
counter_imb = Counter(e_train)
counter_b = Counter(e_train_b)
print(counter_imb)
print(counter_b)
pca_visualization(X_train_b,e_train_b)
rate = np.mean(e_train_b)
length = len(e_train_b)
print("There are {}".format(length), "examples in the training set after SMOTE and now {}".format(round(length*rate)), "of them are labeled as Epilepsy.")
Now, the data is balanced. Because there is a high number of features, it may take a long time when the models process the data. A good way to shorten this period is to reduce the number of features by applying feature selection/extraction methods. In this project, I will obtain three different training sets by using three different dimensionality reduction techniques, measure the performance of the classification for three different scenarios, and select the training set that results in the highest performance in the test set.
Principal Component Analysis for Speeding Up the Machine Learning Algorithm
Principal Component Analysis is a feature extraction method that is used to reduce the number of features. It transforms correlated features into linear and independent components.
In this project, I selected the number of principal components as 40 since this number is relatively small and the selected principal components still represent the original data at a good level.
pca = PCA(n_components = 40)
pca.fit(X_train_b)
X_train_reduced_pca = pca.transform(X_train_b)
X_test_reduced_pca = pca.transform(X_test)
To be able to make better comparisons between three different datasets that are obtained after three different dimensionality reduction methods, I wanted these datasets to have the same size. Therefore, I will select the most important 40 features in the other dimensionality reduction methods.
Feature Importance with Extra Tree Classifier
Extra Tree Classifier is an Ensemble Learning method. It is very similar to the Random Forest model. The only difference is that the entire dataset is used to create decision trees and the features and splits are selected at random in this algorithm.
Like in the other Ensemble Learning models, the ones that result in the highest purity/lowest impurity are selected as the most important features in this method.
from sklearn.ensemble import ExtraTreesClassifier
feature_imp = []
for x in range(1,500):
model = ExtraTreesClassifier()
model.fit(X_train_b,e_train_b)
feature_imp =+ model.feature_importances_
from operator import itemgetter
from heapq import nlargest, nsmallest
largest40 = nlargest(40, enumerate(feature_imp), itemgetter(1))
largest40_col_name = []
largest40_values = []
largest40_index = []
for index, value in largest40:
name = 's'+ str(index+1)
largest40_col_name.append(name)
largest40_values.append(value)
largest40_index.append(index)
output_notebook()
p = figure(y_range = largest40_col_name, title = "Most Important 40 Features", x_axis_label = "Importance", tools = "")
p.hbar(y = largest40_col_name, right = largest40_values, left = 0, height = 0.5, color = 'orange', fill_alpha = 0.5)
show(p)
X_train_f_reduced = X_train_b[:,largest40_index]
X_test_f_reduced = X_test[:,largest40_index]
Sequential Forward Feature Selection
Sequential Forward Feature Selection is another common feature selection method that is used to filter the most important features. The algorithm detects the feature that has the highest correlation with the target variable first. In the next step, pairs of features are formed by using the selected feature and one of the remaining features, and the best pair is selected. Then the triplets are formed by using the selected features and one of the remaining features and the best triplet is selected. This goes until reaching a predefined point. The disadvantage of this algorithm is its slow running time.
from mlxtend.feature_selection import SequentialFeatureSelector as sfs
mdl = KNeighborsClassifier(n_neighbors = 1)
sequential_forward = sfs(mdl,
k_features=40,
forward=True,
floating=False,
verbose=2,
scoring='accuracy',
cv=10)
sequential_forward_forward = step_forward.fit(X_train_b, e_train_b)
feat_cols = list(sequential_forward.k_feature_idx_)
print(feat_cols)
import matplotlib
matplotlib.rcParams['figure.figsize'] = (10,10)
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
fig = plot_sfs(sequential_forward.get_metric_dict(), kind='std_dev')
plt.ylim([0.75, 1])
plt.title('Sequential Forward Selection')
plt.grid()
plt.show()
As can be seen from the graph, the selected 40 features result in the high performance in the validation set. (Because we do parameter selection, the performance is measured in the validation set). This means that adding more features will not affect the performance of the model. It will only increase the training time.
X_train_sf_reduced = X_train_b[:,feat_cols]
X_test_sf_reduced = X_test[:,feat_cols]
At the end of these dimensionality reduction methods, I obtained three different training sets that have the same number of samples and features. In the next step, I will put those datasets into the exactly same machine learning algorithm, K-Nearest Neighbors with k = 1, make classification on the same dataset (test set) and make comparisons among the dimensionality reduction methods.
K-Nearest Neighbors after Feature Extraction with Principal Component Analysis
from sklearn.metrics import classification_report, confusion_matrix, plot_confusion_matrix
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
def model_comparison(X_train, y_train, X_test, y_test, model, parameters, cv, scoring):
model_cv = GridSearchCV(model, parameters, cv = cv, scoring = scoring)
model_cv.fit(X_train,y_train)
prediction = model_cv.predict(X_test)
prediction_prob = model_cv.predict_proba(X_test)[:,1]
print("Classification report:\n\n",classification_report(prediction,y_test))
plot_confusion_matrix(model_cv, X_test, y_test,
display_labels = ['Not Epilepsy','Epilepsy'],
cmap = plt.cm.Blues,
values_format = ".2f")
return prediction, prediction_prob, model_cv
K-Nearest Neighbors is one of the simplest and most commonly used algorithms in machine learning. It takes the k closest samples into account when predicting the unknown label of a sample. As we decrease the k value, the algorithm becomes more complex and tends to take noise into account and this may result in overfitting while increasing k value may result in underfitting.
In general, I will prefer to use the F1 score during hyperparameter optimization to be able to obtain higher precision and recall values in the test set.
from sklearn.neighbors import KNeighborsClassifier
param_knn = {'n_neighbors':np.arange(1,5)}
knn_prediction_pca, knn_prediction_prob_pca, knn_pca = model_comparison(X_train = X_train_reduced_pca,
y_train = e_train_b,
X_test = X_test_reduced_pca,
y_test = e_test,
model = KNeighborsClassifier(),
parameters = param_knn,
cv=10, scoring= 'f1')
K-Nearest Neighbors after Feature Selection with Extra Trees Classifier
knn_prediction_f, knn_prediction_prob_f, knn_f = model_comparison(X_train = X_train_f_reduced,
y_train = e_train_b,
X_test = X_test_f_reduced,
y_test = e_test,
model = KNeighborsClassifier(),
parameters = param_knn,
cv=10, scoring= 'f1')
K-Nearest Neighbors after Feature Selection with Sequential Forward Feature Selection
param_knn = {'n_neighbors':np.arange(1,5)}
knn_prediction_sf, knn_prediction_prob_sf, knn_sf = model_comparison(X_train = X_train_sf_reduced,
y_train = e_train_b,
X_test = X_test_sf_reduced,
y_test = e_test,
model = KNeighborsClassifier(),
parameters = param_knn,
cv=10, scoring= 'f1')
When we make a comparison between the performance of the three different classifications, it looks like all of 3 dimensionality reduction techniques result in very similar classifications. The differences between the predictions are most likely due to the randomness. Since the running time is lowest in PCA, I will use the data which it's dimension is reduced with PCA for the rest of the project.
Decision Tree
Decision Tree is one of the simple and interpretable algorithms in machine learning. The algorithm asks questions and makes classifications based on the answers to these questions. There are many hyperparameters used in this algorithm. However, I chose the most common ones to make hyperparameter optimization.
from sklearn.tree import DecisionTreeClassifier
param_dtc = {'max_depth': [200,250,300,1000],
'max_features': [25,30,35],
'min_samples_split': [5,10,20,30,40,50],
'min_samples_leaf': [25,50,150,250,500,1000]}
dtc_prediction_pca, dtc_prediction_prob_pca, dtc_pca = model_comparison(X_train = X_train_reduced_pca,
y_train = e_train_b,
X_test = X_test_reduced_pca,
y_test = e_test,
model = DecisionTreeClassifier(),
parameters = param_dtc,
cv = 10,
scoring = 'f1')
Random Forest
Random Forest is simply the combination of multiple decision trees. A Decision Tree is built using the entire dataset and all the features. In Random Forest, however, samples are taken from the data randomly, and then Decision Trees are created based on these samples until reaching a predefined number of Decision Trees.
The parameters in the Random Forest are similar to the Decision Tree. Because it takes a lot of time to run the Random Forest model on the training data, I only used three parameters to make hyperparameter optimization. The parameters of 'max_features' and 'max_depth' are like the ones used in the Decision Tree.
Note: After detecting that the recall score of the Random Forest is very high and the precision score is very low with F1 scoring and changing the scoring as precision does not affect the recall score negatively, I used precision for hyperparameter optimization in Random Forest to be able to obtain more precise model.
from sklearn.ensemble import RandomForestClassifier
param_rf = {'n_estimators': [300,500,700],
'max_features': [3,5,7],
'max_depth': [80,100,120]}
rf_prediction_pca, rf_prediction_prob_pca, rf_pca = model_comparison(X_train = X_train_reduced_pca,
y_train = e_train_b,
X_test = X_test_reduced_pca,
y_test = e_test,
model = RandomForestClassifier(),
parameters = param_rf,
cv=10, scoring = 'precision')
Neural Network
Neural Network is a computing system made up of a number of simple, highly interconnected processing elements, which process information by their dynamic state response to external inputs. Neural networks are typically organized in layers. Layers are made up of a number of interconnected 'nodes' which contain an 'activation function'. Patterns are presented to the network via the 'input layer', which communicates to one or more 'hidden layers' where the actual processing is done via a system of weighted 'connections'. The hidden layers then link to an 'output layer' where the answer is output. [1]
Like in the machine learning algorithms, the main purpose here is to minimize the difference between the predicted value and actual value. After building a Neural Network, we can show the difference between predicted output and actual value with a function. Many algorithms are used to find the parameters of this function that result in the minimum difference between the predicted value and actual value. We also call these algorithms as optimizers.
In this project, I preferred to use Adam optimizer which can also be seen as the combination of AdaGrad and RMSProp optimizers.
from keras.models import Sequential
from keras.layers import Dense
import tensorflow as tf
from sklearn import metrics
from sklearn.model_selection import KFold
from keras.layers import Activation
from keras.optimizers import Adam
X_reduced_pca = np.concatenate((X_train_reduced_pca,X_test_reduced_pca))
e_reduced_pca = np.concatenate((e_train_b, e_test))
kf_cv = KFold(10, shuffle=True, random_state = 42)
oos_y = []
oos_pred = []
fold = 0
def tuning_parameters(learning_rate, epochs):
for train, test in kf_cv.split(X_reduced_pca):
x_train = X_reduced_pca[train]
y_train = e_reduced_pca[train]
x_test = X_reduced_pca[test]
y_test = e_reduced_pca[test]
nn = Sequential()
nn.add(Dense(100, input_dim = 40, activation = 'relu'))
nn.add(Dense(100, activation = 'relu'))
nn.add(Dense(1, activation = 'sigmoid'))
opt = Adam(lr = learning_rate)
nn.compile(loss='binary_crossentropy', optimizer=opt, metrics=tf.keras.metrics.RecallAtPrecision(precision=0.9))
nn.fit(x_train,y_train,validation_data=(x_test,y_test), epochs = epochs, verbose = 0)
y_pred = nn.predict_classes(x_test)
oos_y.append(y_test)
oos_y_concat = np.concatenate(oos_y)
oos_pred.append(y_pred)
oos_pred_concat = np.concatenate(oos_pred)
f1_score = metrics.f1_score(oos_pred_concat,oos_y_concat)
rec = metrics.recall_score(oos_pred_concat,oos_y_concat)
prec = metrics.precision_score(oos_pred_concat,oos_y_concat)
return f1_score, rec, prec
learning_rates = [0.0001,0.001, 0.01, 0.1, 1]
f1 = []
rec = []
prec = []
for i in range(len(learning_rates)):
x,y,z = tuning_parameters(learning_rate = learning_rates[i], epochs = 100)
f1.append(x)
rec.append(y)
prec.append(z)
nn_pca = Sequential()
nn_pca.add(Dense(1200, input_dim = 40, activation = 'relu'))
nn_pca.add(Dense(1200, activation = 'relu'))
nn_pca.add(Dense(1200, activation = 'relu'))
nn_pca.add(Dense(1, activation = 'sigmoid'))
opt = Adam(lr = 0.001)
nn_pca.compile(loss='binary_crossentropy', optimizer=opt, metrics=tf.keras.metrics.Recall())
nn_pca.fit(X_train_reduced_pca,e_train_b,validation_data=(X_test_reduced_pca,e_test), epochs = 30, verbose = 0)
from sklearn.metrics import confusion_matrix
from mlxtend.plotting import plot_confusion_matrix as plot_cm
nn_prediction_prob_pca = nn_pca.predict(X_test_reduced_pca)
nn_prediction_prob_pca_cm = (nn_prediction_prob_pca > 0.5)
nn_prediction_pca = nn_pca.predict_classes(X_test_reduced_pca)
cm = confusion_matrix(e_test, nn_prediction_prob_pca_cm)
class_names = ['Not Epilepsy', 'Epilepsy']
fig,ax = plot_cm(conf_mat=cm,
colorbar=True,
class_names = class_names,
figsize = (6,6))
Model Selection
After applying a range of algorithms, the next step is to compare them among each other from different perspectives and select the one to make a prediction. Because the data we used to make a prediction is imbalanced, I preferred to choose precision and recall scores as one of my criteria in model selection.
from sklearn.metrics import precision_recall_curve, plot_precision_recall_curve, f1_score
import matplotlib.pyplot as plt
precision_knn, recall_knn, _ = precision_recall_curve(e_test,knn_prediction_prob_pca)
precision_dtc, recall_dtc, _ = precision_recall_curve(e_test,dtc_prediction_prob_pca)
precision_rf, recall_rf, _ = precision_recall_curve(e_test,rf_prediction_prob_pca)
precision_nn, recall_nn, _ = precision_recall_curve(e_test,nn_prediction_prob_pca)
f1_knn = f1_score(e_test, knn_prediction_pca)
f1_dtc = f1_score(e_test, dtc_prediction_pca)
f1_rf = f1_score(e_test, rf_prediction_pca)
f1_nn = f1_score(e_test, nn_prediction_pca)
fig = plt.figure(figsize= (10,10))
ax = fig.add_subplot(1,1,1)
ax.plot(recall_knn, precision_knn, label = "K-Nearest Neighbors (F1 = {:.5f})".format(f1_knn), c = 'gray')
ax.plot(recall_dtc, precision_dtc, label = "Decision Tree (F1 = {:.5f})".format(f1_dtc), c = 'orange')
ax.plot(recall_rf, precision_rf, label = "Random Forest (F1 = {:.5f})".format(f1_rf), c = 'purple')
ax.plot(recall_nn, precision_nn, label = "Neural Network (F1 = {:.5f})".format(f1_nn), c = 'green')
ax.grid()
ax.legend()
ax.set_xlabel("Recall")
ax.set_ylabel("Precision")
ax.set_title("Precision Recall Curve for Different Models")
plt.show()
from sklearn.metrics import roc_auc_score, accuracy_score, recall_score, precision_score
def specificity_score(e_act, e_pred, threshold):
return sum((e_pred < threshold) & (e_act == 0))/sum(e_act == 0)
def report(e_act, e_pred, threshold):
precision = precision_score(e_act, (e_pred>threshold))
recall = recall_score(e_test, (e_pred>threshold))
specificity = specificity_score(e_act, e_pred, threshold)
accuracy = accuracy_score(e_act, (e_pred>threshold))
print('Precision:%.5f'%precision)
print('Recall:%.5f'%recall)
print('Specificity:%.5f'%specificity)
print('Accuracy:%.5f'%accuracy)
return precision, recall, specificity, accuracy
print("The performance of the models on the test set are summarised in the table below:\n")
print("\033[1m" + "K-Nearest Neighbors"+ '\033[0m')
precision_knn, recall_knn, specificity_knn, accuracy_knn = report(e_test, knn_prediction_prob_pca, threshold = 0.5)
print("\n")
print("\033[1m" + "Decision Tree"+ '\033[0m')
precision_dtc, recall_dtc, specificity_dtc, accuracy_dtc = report(e_test, dtc_prediction_prob_pca, threshold = 0.5)
print("\n")
print("\033[1m" + "Random Forest"+ '\033[0m')
precision_rf, recall_rf, specificity_rf, accuracy_rf = report(e_test, rf_prediction_prob_pca, threshold = 0.5)
print("\n")
print("\033[1m" + "Neural Network"+ '\033[0m')
precision_nn, recall_nn, specificity_nn, accuracy_nn = report(e_test, nn_prediction_prob_pca.flatten(), threshold = 0.5)
performance = tuple([precision_knn,precision_dtc,precision_rf,precision_nn,
recall_knn,recall_dtc,recall_rf,recall_nn,
specificity_knn ,specificity_dtc ,specificity_rf,specificity_nn,
accuracy_knn , accuracy_dtc , accuracy_rf, accuracy_nn])
metrics = np.concatenate([["Precision"],
["Recall"],
["Specificity"],
["Accuracy"]])
models = ["K-Nearest Neighbors", "Decision Tree","Random Forest", "Neural Network"]
stacked = [(metric,model) for metric in metrics for model in models]
from bokeh.models import ColumnDataSource, FactorRange
from bokeh.plotting import figure
from bokeh.palettes import Viridis3
from bokeh.models import LabelSet
from bokeh.transform import factor_cmap
reset_output()
output_notebook()
x = stacked
y = np.round(np.array(performance)*100,2)
source = ColumnDataSource(data=dict(x=x, y=y))
p = figure(x_range=FactorRange(*x),
plot_height=400,
plot_width = 1200,
title="Model Comparison from Different Perspectives",
toolbar_location=None,
tools="")
p.vbar(x='x',
top='y',
width=0.9,
source=source,
fill_color = factor_cmap('x',
palette = Viridis3,
factors = models,
start = 1,
end = 3))
labels = LabelSet(x='x',
y='y',
text='y',
level='glyph',
x_offset=-13,
y_offset=0,
source=source,
text_font_size="11px",
render_mode='canvas')
p.add_layout(labels)
p.title.text_font_size = "12px"
p.y_range.start = 0
p.x_range.range_padding = 0.1
p.yaxis.axis_label = 'Performance(%)'
p.xaxis.major_label_orientation = 1
show(p)
When we look at the graph above, we see that the Random Forest model predicted 99.3% of the cases with epileptic seizure and 96.04% of seizure-free cases correctly. However, it is also important to note that 13.75% of the 'epilepsy' predictions are incorrect with this model. In other words, the precision score of the model is relatively low.
The K-Nearest Neighbors model, on the other hand, is more precise. 98.36% of the 'epilepsy' predictions are correct. Besides this, 99.65% of the seizure-free cases are predicted correctly with this model. When we look at the recall score, on the other hand, we see that the model predicted only 83.3% of the cases with epileptic seizure are predicted correctly and this is not at the desired level at all.
In general, the main factors that are considered when selecting a model can be summarized as below:
Because predicting people who have epilepsy correctly is the most important criterion in general, I selected the Random Forest model to make predictions even though it is a relatively noninterpretable and complex model.
Analysis of Incorrect Predictions
After making predictions on a new dataset with the Random Forest model, the next step can be to analyze these predictions.
As can be seen from the confusion matrix, 91 cases are predicted as having epilepsy incorrectly even though there was no sign of seizure during the EEG recording in those cases. In other words, there are 91 false positives. This is a relatively high number. To be able to better understand why these cases are predicted as having epilepsy, I will compare the distributions of false positives with the distributions of true negatives from two different perspectives.
incorrect_ind = rf_pca.predict(X_test_reduced_pca)!=e_test
correct_ind = rf_pca.predict(X_test_reduced_pca)==e_test
incorrect_df = pd.DataFrame(X_test[incorrect_ind])
incorrect_df["e"] = e_test.values[incorrect_ind]
correct_df = pd.DataFrame(X_test[correct_ind])
correct_df["e"] = e_test.values[correct_ind]
fp = incorrect_df[incorrect_df["e"]==0]
fn = incorrect_df[incorrect_df["e"]==1]
tn = correct_df[correct_df["e"]==0]
tp = correct_df[correct_df["e"]==1]
tn.T.boxplot(figsize = (30,300),vert=False)
plt.xlim(-7.75,13)
plt.show()
The plot above is the distribution of the true negative samples.
fp.T.boxplot(figsize = (30,13),vert=False)
plt.xlim(-7.75,13)
plt.show()
The plot above is the distribution of the false positive samples.
There is a clear difference between the distribution of false positives and true negatives. There are many more outliers in the distribution of false positive samples. Also, IQR (interquartile range) is larger for the cases that are predicted as having epilepsy even though there was no sign of epileptic seizures during the EEG recording.
tn.boxplot(figsize = (30,30),vert=False)
plt.xlim(-7.85,13.90)
plt.show()
The plot above is the distribution of features of the true negative samples.
fp.boxplot(figsize = (30,30),vert=False)
plt.xlim(-7.85,13.90)
plt.show()
The plot above is the distribution of features of the false positive samples.
Like in the distribution of samples, there is also a clear difference between the distribution of features of false positives and true negatives. There are many outliers and IQR is larger in the features of false positives.
Because I don't know reading EEG recordings, and don't have a domain knowledge, I didn't want to remove outliers in the data preprocessing phase at the beginning of the project. Because an outlier could also be a sign of an epileptic seizure and I didn't want to ignore them.
However, after seeing how different the distributions of false positives and true negatives are, we can handle the samples and features by taking into outliers and IQR into account before making prediction. If we see that the distribution of a sample/feature fits the distribution of the false positive samples/features of false positives, we can repeat the EEG process or adjust the outliers in the sample/feature.